Finding a standard dataset format for machine learning

Exploring new dataset format options for OpenML.org
OpenML
Data
Author

Pieter Gijsbers, Mitar Milutinovic, Prabhant Singh, Joaquin Vanschoren

Published

March 23, 2020

Machine learning data is commonly shared in whatever form it comes in (e.g. images, logs, tables) without being able to make strict assumptions on what it contains or how it is formatted. This makes machine learning hard because you need to spend a lot of time figuring out how to parse and deal with it. Some datasets are accompanied with loading scripts, which are language-specific and may break, and some come with their own server to query the dataset. These do help, but are often not available, and still require us to handle every dataset individually.

With OpenML, we aim to take a stress-free, 'zen'-like approach to working with machine learning datasets. To make training data easy to use, OpenML serves thousands of datasets in the same format, with the same rich meta-data, so that you can directly load it (e.g. in numpy,pandas,…) and start building models without manual intervention. For instance, you can benchmark algorithms across hundreds of datasets in a simple loop.

For historical reasons, we have done this by internally storing all data in the ARFF data format, a CSV-like text-based format that includes meta-data such as the correct feature data types. However, this format is loosely defined, causing different parsers to behave differently, and the current parsers are memory-inefficient which inhibits the use of large datasets. A more popular format these days is Parquet, a binary single-table format. However, many current machine learning tasks require multi-table data. For instance, image segmentation or object detection tasks have both images and varying amounts of annotations per image.

In short, we are looking the best format to internally store machine learning datasets in the foreseeable future, to extend OpenML towards all kinds of modern machine learning datasets and serve them in a uniform way. This blog post presents out process and insights. We would love to hear your thoughts and experiences before we make any decision on how to move forward.

Scope

We first define the general scope of the usage of the format:

Impact on OpenML (simplicity, maintenance)

Since OpenML is a community project, we want to keep it as easy as possible to use and maintain:

When no agreed upon schema exists, we could offer a forum for the community to discuss and agree on a standard schema, in collaboration with other initiatives (e.g. frictionlessdata). For instance, new schemas could be created in a github repo to allow people to do create pull requests. They could be effectively used once they are merged.

Requirements

To draw up a shortlist of data formats, we used the following (soft) requirements:

Shortlist

We decided to investigate the following formats in more detail:

Arrow / Feather

Benefits:

Drawbacks:

Parquet

Benefits:

Drawbacks:

SQLite

Benefits:

Drawback:

HDF5

Benefits:

Drawbacks:

CSV

Benefits:

Drawbacks:

Overview

Parquet HDF5 SQLite CSV
Consistency across
different platforms
? ✅ (dialect)
Support and documentation
Read/write speed so-so
Incremental
reads/writes
Yes, but not
supported by current
Python libs
Yes (but not
random access)
Supports very large and high-dimensional datasets ❌ (limited nr. columns
per table)
✅ Storing tensors
requires flattening.
Simplicity ❌ (basically full
file system)
✅ (it’s a database)
Metadata support Only minimal ❌ (requires separate
metadata file)
Maintenance Apache project, open
and quite active
Closed group,
but active community on
Jira and conferences
Run by a company.
Uses an email list.
Available examples of
usage in ML
Flexibility Only tabular Very flexible,
maybe too flexible
Relational multi-table Only tabular
Versioning/Diff Only via S3 or delta lake
Different length vectors As blob ❌ ?

Performance benchmarks

There exist some prior benchmarks (here and here) on storing dataframes. These only consider single-table datasets. For reading/writing, CSV is clearly slower and Parquet is clearly faster. For storage, Parquet is most efficient but zipped CSV as well. HDF requires a lot more disk space. We also ran our own benchmark to compare the writing performance of those data formats for very large and complex machine learning datasets, but could not find a way to store these in one file in Parquet.

Version control

Version control for large datasets is tricky. For text-based formats (CSV), we could use git LFS store the datasets and have automated versioning of datasets. We found it quite easy to export all current OpenML dataset to GitLab: https://gitlab.com/data/d/openml.

The binary formats do not allow us to track changes in the data, only to recover the exact versions of the datasets you want (and their metadata). Potentially, extra tools could still be used to export the data to dataframes or text and then compare them. Delta Lake has version history support, but seemingly only for Spark operations done on the datasets.

We need your help! If we have missed any format we should investigate, or misunderstood those we have investigated, or missed some best practice, please tell us. You are welcome to comment below, or send us an email at openmlhq@googlegroups.com

Contributors to this blog post: Mitar Milutinovic, Prabhant Singh, Joaquin Vanschoren, Pieter Gijsbers, Andreas Mueller, Matthias Feurer, Jan van Rijn, Marcus Weimer, Marcel Wever, Gertjan van den Burg, Nick Poorman